Getting Started

Hi, and welcome to the intro to search tutorial! This is gonna be fun. Let's start.

Information Retrieval

We won't dwell on this much, but those studying how to search stuff with computers call what they study information retrieval. The term originated in 1950 in a paper about a punch-card based system produced by Calvin Mooers, and has been used in the study of search ever since. (This is about the same time computers started being used for searching.)

We'll introduce some of the standard terminology of information retrieval so you're familiar with it, but don't get too hung up on it. (Here's the first term! Information Retrieval or IR for short. More on Wikipedia ;)

Documents

Documents are the building blocks of any search engine. Documents are what you're searching. They can be documentation pages, books, chapters of books, or other semi-structured text. The goal of search is to allow one to run queries over the documents to find what you're looking for. There can be multiple pieces of data associated with each document, and those pieces of data can have different formats, text and datetimes for example. The definition of the different pieces of data that make up a document is called a schema.

There are two main problems to solve when it comes to search: how to index documents, and how to query them. First, we'll talk about indexing.

Indexing

Sometimes the data we're searching is small enough that we can just scan the full data for matches every time. In that case, we can just use a regular expression. But we're not talking about regexes today. We're talking about the case when we want to search a larger amount of data, really fast, where scanning the entire dataset every time is too slow. That's where a different set of techniques come in, starting with indexing.

Indexing is the process of taking documents, extracting the relevant terms (words, in IR literature) from them, and creating a data structure that can then be used to quickly find the documents again. Here we're going to take a look at indexing a very small set of documents. What is the data structure that is produced? Later, we'll form queries against this data structure to help us find the documents we're looking for.


In [ ]:
# We're using the awesome-tastic search library `whoosh`. Give Matt Chaput a hug if you see him!
import whoosh

In [1]:
# you can run this on the command line with `python index_small_example.py`
%run index_small_example.py


updating existing index in 'ex_index'
processing document 0; document contents: Ce matin j'ai bu un café.
processing document 1; document contents: I've been emailing all day long.
processing document 2; document contents: http://www.python.org/
processing document 3; document contents: I love Python!
processing document 4; document contents: I never want to send another email ever again.
processing document 5; document contents: Email email email. Sooo many emails.
done!

In [25]:
# OK, so what does the index look like? By default, whoosh creates the index in a directory structure on disk, though it can also
# create temporary indexes in memory. Here's the index we created:
! file ex_index/*


ex_index/MAIN_WRITELOCK:            empty
ex_index/MAIN_f9t1kmj19xfij2l1.seg: data
ex_index/_MAIN_1.toc:               data

Looks like the indexes that Whoosh is generating for us are a bunch of binary data.

_MAIN_1.toc is the TableOfContents file, which basically consists of some versioning info (so the format can change and whoosh can tell), a pickled version of the index's schema, and a list of its segments. For our example, the schema looks like this:

schema = Schema(id=ID(unique=True), content=TEXT)

Pretty simple, huh?

Segments are the way that whoosh breaks up the meat of its index in order to make its operations fast. Here's what the whoosh source has to say about them:

  Having multiple segments allows quick incremental indexing: just create a
  new segment for the new documents, and have the index overlay the new
  segment over previous ones for purposes of reading/search. "Optimizing"
  the index combines the contents of existing segments into one (removing
  any deleted documents along the way).

(More info about how fast Whoosh is)

But what exactly is stored inside of each segment, and how does that help us search documents?


In [4]:
# OK, let's get set up to take a look at the contents of the index!
from whoosh.index import open_dir
ix = open_dir('ex_index')
print "{} documents in index `ix`".format(ix.doc_count())
r = ix.reader()


5 documents in index `ix`

In [28]:
# Now we can play around with the contents of the index! Here are some suggestions:
print list(r.all_doc_ids())
print list(r.all_terms())
print list(r.postings('content', 'email').all_ids())

from whoosh import formats
def print_postings(r):
    for fieldname, text, docnum, weight, valuestring in r.iter_postings():
        print fieldname, text, docnum, weight,
        if valuestring is not None:
            print formats.Positions().decode_as('frequency', valuestring),
            print formats.Positions().decode_as('positions', valuestring),
        print
            
print_postings(r)

# Full API at http://pythonhosted.org/Whoosh/api/reading.html


 [0, 1, 2, 3, 4]
[('content', 'again'), ('content', 'another'), ('content', 'brown'), ('content', 'dog'), ('content', 'email'), ('content', 'emails'), ('content', 'ever'), ('content', 'fox'), ('content', 'http'), ('content', 'jumps'), ('content', 'lazy'), ('content', 'love'), ('content', 'many'), ('content', 'never'), ('content', 'over'), ('content', 'python'), ('content', 'quick'), ('content', 'send'), ('content', 'sooo'), ('content', 'want'), ('content', 'www.python.org'), ('id', '0'), ('id', '1'), ('id', '2'), ('id', '3'), ('id', '4')]
[3L, 4L]
content again 3 1.0 1 [7]
content another 3 1.0 1 [4]
content brown 0 1.0 1 [2]
content dog 0 1.0 1 [7]
content email 3 1.0 1 [5]
content email 4 3.0 3 [0, 1, 2]
content emails 4 1.0 1 [5]
content ever 3 1.0 1 [6]
content fox 0 1.0 1 [3]
content http 2 1.0 1 [0]
content jumps 0 1.0 1 [4]
content lazy 0 1.0 1 [6]
content love 1 1.0 1 [1]
content many 4 1.0 1 [4]
content never 3 1.0 1 [1]
content over 0 1.0 1 [5]
content python 1 1.0 1 [2]
content quick 0 1.0 1 [1]
content send 3 1.0 1 [3]
content sooo 4 1.0 1 [3]
content want 3 1.0 1 [2]
content www.python.org 2 1.0 1 [1]
id 0 0 1.0
id 1 1 1.0
id 2 2 1.0
id 3 3 1.0
id 4 4 1.0

Postings

So ids are the documents IDs and terms are the words from the documents, that's pretty self-explanatory. But what's a posting? In the language of IR, a posting really just means a document ID and any other information that's associated with it And a posting list is a list of postings, mapped to a term in the index. Whoosh includes a few other pieces of information in a posting as well.

But we haven't really talked about indexes yet. You can think of an index the same way you think of the index in the back of a reference book. (They are the same word after all!) In IR literature, the data structure is actually called an inverted index. So what's inverted about it? Turns out they're considering a mapping from documents->words to be a forward index, and words->documents to be an inverted index. So that's why it's inverted. (More here.

When creating an inverted index for a set of documents, whoosh does some processing that creates something similar to a book index for those documents. For text, that means breaking the text up by possible search terms, which are usually words, and associating each word with the documents it appears in, as well as some other information, such as the frequency of a given word in the document (more on that later). Whoosh can then combine words that someone searches for and, combined with the information in the index, fetch up the most relevant documents.

We don't really need to think about postings to use whoosh at a high level, but it's useful to be familiar with the terminology, as it's part of the vocabulary of information retrieval.

Tasks

  • Modify the simple schema to incorporate a new data type. How does this change the posting list?
  • What's the weight term in a posting? Where does it come from?
  • Stemming is a method for making searches for e.g. "email" also match "emails" and "emailing". Read about it here and modify the schema to support stemming.
  • Do the same for accent folding! (Same documentation.)

Querying

OK, now that we know about indexes, we're going to introduce a larger data set so we can search it. This tutorial comes with a set of data generated from a subset of Project Gutenberg. You should have pre-generated metadata and a search index for this data as a part of preparation for this tutorial. If you haven't yet, do it now! It takes a few minutes to run.


In [ ]:
%run extract_metadata.py
%run index_gutentexts.py

In [23]:
from whoosh.qparser import QueryParser
gutenindex = open_dir("gutenindex")
qp = QueryParser("content", schema=gutenindex.schema)
q = qp.parse("women Boston")
with gutenindex.searcher() as s:
    results = s.search(q)
    print results
    print len(results)


<Top 10 Results for And([Term('content', u'women'), Term('content', u'boston')]) runtime=0.0105450153351>
78

You can see that when whoosh parsed the text query "women Boston" into its query representation, it interpreted it by splitting the query into the two individual words and combining those two with an AND boolean operator. [Here's a summary of whoosh's default query language: http://pythonhosted.org/Whoosh/querylang.html]

Relevance

Whoosh returns search results sorted by relevance, most-relevant first. What does it mean by relevance? Sometimes the same word from the query appears multiple times in a given document. Whoosh then weights that document more in the results. (In IR literature, this is called term frequency weighting.)

Whoosh can also give different weightings to different fields, so e.g. it makes a bigger difference if a book's title matches the query than if the book text does.

Tasks

  • Play around with the query language. How can I search for...

    • a specific title
    • a specific title containing multiple words?
    • wildcards
    • an exact phrase?
  • Modify the query program to search all fields by default (title/author/content/etc.).

  • Modify the query program to give much higher weight to titles and authors than to the text of the book.

  • Can you modify the query language to support using boolean operators in a different language? This might help.

Advanced Topics

This is all on you. Here are some additional things to learn if you get this far. If you want some pointers or something explained, raise your hand and we'll help you out.

Sorting and faceting

http://pythonhosted.org/Whoosh/facets.html

Spelling Correction

Modify query_webapp.py to give mispelling suggestions if no results are found for a given search query.

http://pythonhosted.org/Whoosh/keywords.html

Highlighting

Modify query_webapp.py to serve up results with the search terms highlighted.

http://pythonhosted.org/Whoosh/highlight.html

N-grams

http://pythonhosted.org/Whoosh/ngrams.html

You may want to start with the smaller index example to play around with N-grams, since full texts of books is not exactly a great use case for search-as-you-type.


In [17]:
print r.frequency('content', 'email')
print list(r.expand_prefix('content', 'e'))


4.0
['email', 'emails', 'ever']

Further Reading

Learning

Kragen Sitaker is writing a tiny search engine in pure Python for the next edition of The Architecture of Open Source Applications book, which is a super great read if you're interested in the nitty gritty of the basic data structures behind indexing:

https://github.com/kragen/500lines/tree/master/search-engine

His "Further information" section contains a bunch of great resources for further study as well.

If you like math, you might find Xapian's introduction to IR theory interesting.

The PageRank paper is a classic modern information retrieval read.

Building

If you end up needing to build fast search capabilities into a really, really large data set, you're going to need a more heavyweight tool than whoosh. ElasticSearch is a popular, well-supported clustering option.